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Abstract 

Natural Language Processing (NLP) is concerned with processing ordinary, unrestricted 
text. This work takes a new approach to a traditional NLP task, using neural computing 
methods. A parser which has been successfully implemented is described. It is a hybrid 
system, in which neural processors operate within a rule based framework. 

The neural processing components belong to the class of Generalized Single Layer Networks 
(GSLN). In general, supervised, feed- forward networks need more than one layer to process 
data. However, in some cases data can be pre-processed with a non-linear transformation, 
and then presented in a linearly separable form for subsequent processing by a single layer 
net. Such networks offer advantages of functional transparency and operational speed. 

For our parser, the initial stage of processing maps linguistic data onto a higher order 
representation, which can then be analysed by a single layer network. This transformation is 
supported by information theoretic analysis. 

Three different algorithms for the neural component were investigated. Single layer nets 
can be trained by finding weight adjustments based on (a) factors proportional to the input, 
as in the Perceptron, (b) factors proportional to the existing weights, and (c) an error mini- 
mization method. In our experiments generalization ability varies little; method (b) is used 
for a prototype parser. This is available via telnet. 

Keywords: Single layer networks, sequential data, natural language, de-coupled training 

1 Introduction 



This paper examines some of the issues that have to be addressed in designing neural processors 
for discrete, sequential data. There is a mutual dependence between the representation of data, 
on the one hand, and the architecture and function of an effective network on the other. As 
a vehicle for examining these processes we describe an automated partial parser that has been 
successfully developed [||, |2|. This takes natural language sentences and returns them with the 
subject and head of the subject located. Ability to generalize is the primary concern. A prototype 
can be accessed via telnet, on which text can be entered and then parsed. Intermediate steps in 
the process can be seen.[| 

In principle, simpler networks with well understood functions have prima facie advantages, 
so looking for a representation that enables such networks to be used should be advantageous. 
With feed forward, supervised networks single layer models enjoy functional transparency and 

^For details, contact author. 
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operational speed, but in general this type of network will need more than one dynamically linked 
layer to model non- linear relationships. 

However, there is an alternative approach. The layers may be de-coupled, and processing at 
different layers done in separate steps. Data can be transformed, which is analogous to processing 
at the first layer, and then presented in a linearly separable form to a single layer net, which is 
analogous to a second layer. This is illustrated in Figure ^, which shows in simplified form 
an archetype of the class of Generalized Single Layer Networks (GSLN). A number of different 
network types that belong to this class are listed by Holden and Rayner [^, page 369] . The critical 
issue is finding an appropriate non-linear transformation to convert data into the required form. 

This paper describes how characteristic linguistic data can be converted into a linearly sepa- 
rable representation that partially captures sequential form. The transformed data is then pro- 
cessed by a single layer network. Three different neural models are tried, and their performance 
is compared. 

All three networks are feed forward models with supervised training. Connection weights can 
be found by adjustments based on (a) factors proportional to the input (b) factors proportional to 
the existing weights, and (c) factors related to the difference between desired and actual output, 
an error minimization method. Model (a) is a traditional Perceptron; model (b) is based on 
the Hodyne network introduced by Wyard and Nightingale at British Telecom Laboratories 
model (c) comes from the class of networks that use an LMS (Least Mean Square error) training 
algorithm. There is little difference in generalization ability, but network (b) performs slightly 
better and has been used for the parser in the prototype. 

Natural language processing (NLP) 

The automatic parsing of natural language poses a significant problem, and neural computing 
techniques can contribute to its solution. For an overview of the scope for work in NLP see 
[^, pages 4-11]. Our prototype gives results of over 90% correct on declarative sentences from 
technical manuals (see Section ^. 

Automated syntactic analysis of natural language has, in the last decade, been characterised 
by two paradigms. Traditional AI, rule based methods contrast with probabilistic approaches, 
in which stochastic models are developed from large corpora of real texts. Neural techniques fall 
into the broad category of the latter, data driven methods, with trainable models developed from 
examples of known parses. The parser we have implemented uses a hybrid approach: rule based 
techniques are integrated with neural processors. 

Parsing can be taken as a pattern matching task, in which a number of parses are postulated 
for some text. A classifier distinguishes between the desired parse and incorrect structures. 
The pattern matching capabilities of neural networks have a particular contribution to make to 
the process, since they can conveniently model negative as well as positive relationships. The 
occurrence of some words or groups of words inhibit others from following, and these constraints 
can be exploited. Arguments on the need for negative information in processing formal languages 

can be extended to natural language. This is an important source of information which has 
been difficult for traditional probabilistic methods to access [0, . Neural methods also have the 
advantage that training is done in advance, so the run time computational load is low. 

Contents of paper 

This paper will first take an overview of some factors that are relevant to neural net design 
decisions, (Section |2|). It then looks at characteristics of natural language (Section and the 
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representation of sequential data (Section ^). A description of the hybrid system used in our 
work is given, (Section Then we examine some of the design issues for the neural components 
of this system. First, the data itself is examined closely. Then we consider how the data can 
be transformed for processing with a single layer net (Section ^). We also comment on the use 
of a Bayesian classifier, which performs slightly less well than the neural networks. Section ^ 
describes the three different networks: (a) the Perceptron, (b) Hodyne and (c) an LMS model. 

In Section |^ we compare the performance of the three networks. Generalization is good, 
providing that enough training data is used. Over 90% of the data is correctly classified, and the 
output can be interpreted so that results for the practical application are up to 100% correct. 
On the small amount of data processed so far the different networks have roughly comparable 
generalization ability, but the Hodyne model is slightly better. A discussion on the function of 
the net follows in Section ^ We conclude (Section |l^) that linguistic data is a suitable candidate 
for processing with this approach. 

2 Neural nets as classifiers of different types of data 

2.1 "Clean" and noisy data 

Consider the fundamental difference in purpose between systems that handle noisy data, where 
it is desired to capture an underlying structure and smooth over some input, compared to those 
that process "clean" data, where every input datum may count. The many applications of neural 
nets in areas such as image processing provide examples of the first type, the parity problem 
is typical of the second.0 These "clean" and "noisy" types can be considered as endpoints of a 
spectrum, along which different processing tasks lie. In the case of noisy data a classifier will 
be required to model significant characteristics in order to generalize eff^ectively. The aim is to 
model the underlying function that generates the data, so the training data should not be over 
fitted. 

On the other hand, for types of data such as inputs to a parity detector, no datum is noise. 
Consider an input pattern that is markedly different, that is topologically distant, from others 
in its class. For one type of data this may be noise. In other instances an "atypical" vector may 
not be noise, and we may need to capture the information it carries to fix the class boundary 
effectively. 

As we demonstrate in the next section, linguistic data needs to be analysed from both angles. 
We need to capture the statistical information on probable and improbable sequences of words; 
we also need to use the information from uncommon exemplars, which make up a very large 
proportion of natural language data. 

2.2 Preserving topological relationships 

Another of the characteristics that is relevant to network design is the extent to which the 
classification task, the mapping from input to output, preserves topological relationships. In 
many cases data which are close in input space are likely to produce the same output, and 
conversely similar classifications are likely to be derived from similar inputs. However, there 
are other classification problems which are different: a very small change in input may cause a 
significant change in output and, on the other hand, very different input patterns can belong to 

^The classical parity problem takes a binary input vector, the elements of which are or 1, and classifies it as 
having an even or odd number of I's. 



3 



the same class. Again, the parity problem is a paradigm example: in every case changing a single 
bit in the input pattern changes the desired output. 

2.3 Data distribution and structure 

Underlying data distribution and structure have their effect on the appropriate type of processor, 
and these characteristics should be examined. Information about the structure of linguistic data 
can be used to make decisions on suitable representations. In this work information theoretic 
techniques are used to support decisions on representation of linguistic data. 

We may also use information on data distribution to improve generalization ability. As shown 
in Section ^, assumptions of normality cannot be made for linguistic data. The distribution 
indicates that in order to generalize adequately the processor must capture information from a 
significant number of infrequent events. 

3 Characteristics of linguistic data 

The significant characteristics of natural language that we wish to capture include: 

• An indefinitely large vocabulary 

• The distinctive distribution of words and other linguistic data 

• A hierarchical syntactic structure 

• Both local and distant dependencies, such as feature agreement 



3.1 Vocabulary size 

Shakespeare is said to have used a vocabulary of 30,000 words, and even an inarticulate computer 
scientist might need 15,000 to get by (counting different forms of the same word stem separately). 
Current vocabularies for commercial speech processing databases are O(IO^). Without specifying 
an upper limit we wish to be able to model an indefinite number of words. 



3.2 The distinctive distribution of linguistic data 

The distribution of words in English and other languages has distinctive characteristics, to which 
Shannon drew attention [^. Statistical studies were made of word frequencies in English language 
texts. In about 250,000 words of text (word tokens) there were about 30,000 different words (word 
types), and the frequency of occurrence of each word type was recorded. If word types are ranked 
in order of frequency, and n denotes rank, then there is an empirical relationship between the 
probability of the word at rank n occurring, p{n), and n itself, known as Zipf's Law: 

p{n) * n = constant 

This gives a surprisingly good approximation to word probabilities in English and other 
languages, and indicates the extent to which a significant number of words occur infrequently. 
For example, words that have a frequency of less than 1 in 50,000 make up about 20-30% of 
typical English language news-wire reports |10|. The LOB corpus^ with about 1 million word 



'The London Oslo Bergen corpus is a collection of texts used as raw material for natural language processing 
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tokens contains about 50,000 different word types, of which about 42,000 occur less than 10 times 
each Q- 

The "zipfian" distribution of words has been found typical of other linguistic data. It is found 
again in the data derived from part-of-speech tags used to train the prototype described here: 
see Figures I and |. 

Other fields in which zipfian distribution is noted include information retrieval and data 
mining (e.g. characteristics of WWW use, patterns of database queries). It has also been observed 
in molecular biology (e.g. statistical characteristics of RNA landscapes, DNA sequence coding). 



Mapping w^ords onto part-of-speech tags 

In order to address the problem of sparse data the vocabulary can be partitioned into groups, 
based on a similarity criterion, as is done in our system. An indefinitely large vocabulary is 
mapped onto a limited number of part-of-speech tag classes. This also make syntactic patterns 
more pronounced. Devising optimal tagsets is a significant task, on which further work remains 
to be done. For the purpose of this paper we take as given the tagsets used in the demonstration 
prototype, described in [12|. At the stage of processing described in this paper 19 tags are used. 



3.3 Grammatical structure 

There is an underlying hierarchical structure to all natural languages, a phenomenon that has been 
extensively explored. Sentences will usually conform to certain structural patterns, as is shown in 
a simplified form in Figure |l[ This is not inconsistent with the fact that acceptable grammatical 
forms evolve with time, and that people do not always express themselves grammatically. Text 
also, of course, contains non-sentential elements such as headings, captions, tables of contents. 
The work described in this paper is restricted to declarative sentences. 



SENTENCE 



pre - subject 




predicate 





pre - 




head 




post - 




main verb 




post - 




liead 








head 




phrase 




main verb 


decompose furtlier 




I 


1 
1 

V 




1 
1 

V 



Figure 1: Decomposition of the sentence into syntactic constituents. 



Within the grammatical structure there is an indefinite amount of variation in the way in 
which words can be assembled. On the other hand, the absence or presence of a single word can 
make a sentence unacceptable, for example 
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*There are many problems arise. •••(!) 
*We is late. ... (2) 



Now consider how linguistic data fits into the scheme described in Section 2.2 on preserving 
topological relationships. Many strings of words that are close in input space are also in the same 
grammatical category, but, conversely, on occasions a single word change can put a string into a 
different category. Our processor has to model this. 



3.4 Local and distant dependencies 

In examining natural language we find there are dependencies between certain words, both locally 
and at a distance. For instance, a plural subject must have a plural verb, so a construction like 
sentence (2) above is incorrect. This type of dependency in its general form is not necessarily 
local. In sentence (3) below the number of the subject is determined by its head, which is not 
adjacent to the verb: 

If a cooler is fitted to the gearbox, [ the pipe [ connections ] of the cooler ] must 
be regularly checked for corrosion. • • • (3) 

The subject of this sentence is the plural "connections". Note that modal verbs like "must" 
have the same singular and plural form in English, but not in many other languages. For an 
automated translation system to process modal verbs it is necessary to find the head of the 
subject that governs the verb and ensure number agreement. 

There are also dependencies between sentences and between more distant parts of a text. We 
aim to model just the intra-sentential dependencies as our automatic parser is developed. 



4 Modelling sequential data 



Three methods have commonly been used to model sequential data, such as language, for connec- 
tionist processing. The first is to move a window along through the sequence, and process a series 
of static "snapshots". Within each window ordering information is not represented. Sejnowski's 



NETtalk is a well known example |13|. 



Another method that warrants further investigation is the use of recurrent nets |14, 15 1. In its 
basic form this type of network is equivalent to a finite state automaton that can model regular 



languages [16|. 



4.1 The n-gram method 

The third method, used in this work, is to take sets of ordered, adjacent elements, which capture 
some of the sequential structure of language. This is related to the well known trigram approach 
used in probabilistic language processing. Combining tags into higher order tuples can also act 
as a pre-processing function, making it more likely that the transformed data can be processed 
by a single layer network (Section ^). 

This method of representation captures some of the structure of natural language, as is shown 
by analysis with information theoretic techniques. There are relationships between neighbouring 
words in text: some are likely to be found adjacent, others are unlikely. When words are mapped 
onto part-of-speech tags this is also the case. This observation is supported by an investigation 
of entropy levels in the LOB corpus, in which 1 million words have been manually tagged. 
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Entropy can be understood as a measure of uncertainty |17, chapter 2]. The uncertainty about 
how a partial sequence will continue can be reduced when statistical constraints of neighbouring 
elements are taken into account. Shannon introduced this approach by analysing sequences of 
letters where the elements of a sequence are single letters, adjacent pairs or triples, with order 
preserved. These are n-grams, with n equal to 1, 2 or 3. The entropy of a sequence represented by 
letter n-grams declines as n increases. When sequences of tags in the LOB corpus were analysed 
the same result was obtained: the entropy of part-of-speech n-grams declines as n increases from 
1 to 3. This indicates that some of the structure of language is captured by taking tag pairs and 
triples as processing elements. 

We adopt the common approach of presenting data as binary vectors for all the networks 
examined in this work. Each element of the input vector represents an ordered tuple of adjacent 
part-of-speech tags, a pair or a triple. If a given tag tuple is present in an input string, then that 
element in the input vector is flagged to 1, else it remains 0. 



5 Description of the hybrid natural language processor 

In order to process unrestricted natural language it is necessary to attack the problem on a broad 
front, and use every possible source of information. In our work the neural networks are part 
of a larger system, integrated with rule based modules. We first assert that there is a syntactic 
structure which can be mapped onto a sentence (Figure |l]). Then we use neural methods to find 
the mapping in each particular case. The grammar used is defined in |12, chapter 5]. 



5.1 Problem decomposition 

In order to effect the mapping of this structure onto actual sentences we decompose the problem 
into stages, finding the boundaries of one syntactic feature at a time. The first step is to find 
the correct placement for the boundaries of the subject, then further features are found in the 3 
basic constituents. In the current prototype the head of the subject is subsequently identified. 
The processing at each stage is based on similar concepts, and to explain the role of the neural 
networks we shall in this paper discuss the first step in which the subject is found. 

The underlying principle employed each time is to take a sentence, or part of a sentence, and 
generate strings with the boundary markers of the syntactic constituent in question placed in all 
possible positions. Then a neural net selects the string with the correct placement. This is the 
grammatical, "yes" string for the sentence. The model is trained in supervised mode on marked 
up text to find this correct placement. The different networks that are examined share the same 
input and output routines, and each was integrated into the same overall system. 



5.2 Tagging 

The first stage in both the training and testing process is to map an indefinite number of words 
onto a limited number of part-of-speech tags. An automatic tagger allocates one or more part- 
of-speech tags to the words to be processed. Many words, typically perhaps 25% to 30%, have 
more than one tag. The CLAWS automatic tagger [18| provided a set of candidate tags for each 



word, but the probabilistic disambiguation modules were not used: disambiguating the tags is 
a sub-task for the neural processer. The CLAWS tagset was mapped onto a customised tagset 



of 19, used for the work described here. Further information on tagset development is in ||12 
chapter 4], and, briefly, in 
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5.3 Hypertags as boundary markers 



As well as part of speech tags we also introduce the syntactic markers, virtual tags, which at this 
stage of the process will demarcate the subject boundary. These hypertags represent the opening 
'[' and closing ']' of the subject. The hypertags have relationships with their neighbours 
in the same way that ordinary tags do: some combinations are likely, some are unlikely. The 
purpose of the parser is to find the correct location of the hypertags. 

With a tagset of 19 parts-of-speech, a start symbol and 2 hypertags we have 22 tags in all. 
Thus, there are potentially 22^+223 = 11132 pairs and triples. In practice only a small proportion 
of tuples are actually realised - see Tables || and |^ . At other stages of the parsing process larger 
tagsets are required (see ||T^). 



5.4 Rule based pruning: the Prohibition Table 

Strings can potentially be generated with the hypertags in all possible positions, in all possible 
sequences of ambiguous tags. However, this process would produce an unmanageable amount of 
data, so it is pruned by rule based methods integrated into the generation process. Applying 
local and semi-local constraints the generation of any string is zapped if a prohibited feature 



is produced. For fuller details see |12] or ||l|. An example of a local prohibition is that the 
adjacent pair (verb, verb) is not allowed. Of course (auxiliary verb, verb) is permissible, 
as is (verb, ] , verb). These rules are similar to those in a constraint grammar, but are 
not expected to be comprehensive. There are also arbitrary length restrictions on the sentence 
constituents: currently, the maximum length of the pre-subject is 15 words, of the subject 12 
words. 



5.5 Neural processing 

This pruning operation is powerful and effective, but it still leaves a set of candidate strings for 
each sentence - typically between 1 and 25 for the technical manuals. Around 25% of sentences are 
left with a single string, but the rest can only be parsed using the neural selector. This averages at 
about 3 for the technical manuals, more for sentences from other domains. In training we manually 
identify the string with the correct placement of hypertags, and the correctly disambiguated part- 
of-speech tags. In testing mode, the correct string is selected automatically. 
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5.6 Coding the input 

As an example of input coding consider a short sentence: 

All papers published in this journal are protected by copyright (4) 

(A) Map each word onto 1 or more tags 



all 


predeterminer 


papers 


noun or verb 


published 


past-part-verb 


in 


preposition or adverb 


this 


pronomial determiner 


j ournal 


noun 


are 


auxiliary-verb 


protected 


past-part-verb 


by 


preposition 


copyright 


noun 




endpoint 



(B) Generate strings with possible placement of subject boundary markers, 

and possible tag allocations (pruned) . 
string no. 1 

strt [ pred ] verb pastp prep prod noun aux pastp prep novm end 



string no. 4 

strt [ pred noun ] pastp adv prod noim aux pastp prep noun end 
string no. 5 

strt [ pred noun pastp ] adv prod noim aux pastp prep noun end 
string no. 6 *** target *** 

strt [ pred noun pastp prep prod noun ] aux pastp prep noun end 
string no . 7 

strt [ pred noun pastp adv prod noun ] aux pastp prep noun end 
(C) Transform strings into sets of tuples 
string no . 1 

(strt, [ ) ( [, pred ) ( pred, ] ) (noun, end) 

(strt, [, pred) ([, pred, ]) (pred, ], verb) (prep, novm, end) 

and similarly for other strings 

(D) The elements of the binary input vector represent all tuples, initialized to 0. If a tuple is 
present in a string the element that represents it is changed from to 1. 
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5.7 Characteristics of the data 



The domain for which our parser was developed was text from technical manuals from Perkins 
Engines Ltd. They were written with the explicit aim of being clear and straight forward ||T9t| . 
Using this text as a basis we augmented it slightly to develop the prototype on which users try 
their own text. Declarative sentences were taken unaltered from the manuals for processing: 
imperative sentences, titles, captions for figures were omitted. 2% of declarative sentences were 
omitted, as they fell outside the current bounds (e.g. the subject had more than 12 words). A 
corpus of 351 sentences was produced: see Table |l[ 



Number of sentences 


351 


Average length 


17.98 words 


No. of subordinate clauses: 




In pre-subject 


65 


In subject 


19 


In predicate 


136 


Co-ordinated clauses 


50 



Table 1: Corpus statistics. Punctuation marks are counted as words, formulae as 1 word. 

This corpus (Tr-all) was divided up 4 ways (Tr 1 to Tr 4) so that nets could be trained on 
part of the corpus and tested on the rest, as shown in Table ^. In order to find the placement 
of the subject boundary markers we do not need to analyse the predicate fully, so the part of 
the sentence being processed is dynamically truncated 3 words beyond the end of any postulated 
closing hypertag. The pairs and triples generated represent part of the sentence only. 

5.8 Data distribution 

Statistics on the data generated by the Perkins corpus are given in Tables ^, |^ and ^. A significant 
number of tuples occur in the test set, but have not occurred in the training set, since, as 
Figures and H show, the distribution of data has a zipfian character. 



Training 


number of 


number of 


Test 


number of 


number of 


Ratio of 


set 


sentences 


strings 


set 


sentences 


strings 


testing/ training 
strings 


Tr-aU 


351 


1037 










Tr 1 


309 


852 


Ts 1 


42 


85 


0.10 


Tr 2 


292 


863 


Ts 2 


59 


174 


0.20 


Tr 3 


288 


843 


Ts 3 


63 


194 


0.23 


Tr 4 


284 


825 


Ts 4 


67 


212 


0.26 



Table 2: Description of training and test sets of data 
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Training 


number of 


number of 


Test 


number of new 


number of new 


set 


pairs in 


pairs in 


set 


pairs in 


pairs in 




'yes' strings 


'no' strings 




'yes' strings 


'no' strings 


Tr-all 


162 


213 








TV 1 


161 


211 


Ts 1 


1 (1%) 


2 (1%) 


TV 2 


160 


210 


Ts 2 


2 (1%) 


3 (1%) 


TV 3 


156 


210 


Ts 3 


6 (4%) 


3 (1%) 


TV 4 


149 


193 


Ts 4 


13 (9%) 


20 (10%) 



Table 3: Part-of-speech pairs in training and testing sets. 'Yes' indicates correct strings, 'no' 
incorrect ones 



Training 


number of 


number of 


Test 


number of new 


number of new 


set 


triples in 


triples in 


set 


triples in 


triples in 




'yes' strings 


'no' strings 




'yes' strings 


'no' strings 


TV-all 


406 


727 








TV 1 


400 


713 


Ts 1 


6 (2%) 


14 (2%) 


TV 2 


383 


686 


Ts 2 


23 (6%) 


41 (6%) 


TV 3 


361 


642 


Ts 3 


45 (12%) 


85 (13%) 


TV 4 


364 


632 


Ts 4 


42 (12%) 


95 (15%) 



Table 4: Part-of-speech triples in training and testing data sets 



Total number of pairs in 'yes' strings 4142 

Total number of pairs in 'no' strings 8652 

Total number of triples in 'yes' strings 3736 

Total number of triples in 'no' strings 8021 

Table 5: Total number of tuples in TV-all, including repetitions 



11 



600 



o 
a 

3 

cr 
<u 




in correct strings 
in incorrect strings 



--> L 



50 100 150 200 
Rank of pair 



250 



Figure 2: Data from 351 sentences in technical manuals. Pairs are ranked by frequency of 
occurrence in correct and incorrect strings. Relationship between rank and frequency shown. 
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Figure 3: Relationship between rank and frequency of triples on the same data. 
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5.9 Interpreting the output 



For training, the set of strings generated by the training text is taken as a whole. Each string 
is given a "grammaticahty measure" F, where F is positive for a correct string, negative for an 
incorrect one. Details follow in Section |^. In testing mode we consider separately the set of 
strings generated for each sentence, and the string with the highest F measure is taken as the 
correct one for that sentence. In the example in Section |5.6| string 6 is the correct one. 

However, on test data, there are 4 metrics for correctness that can be useful in different 
practical situations. The measure "correct-a" requires only that the hypertags are correctly 
placed: string 7 as well as string 6 is correct-a. "correct-b" requires also that all words within 
the subject are correctly tagged, "correct-c" that all words within the part of the sentence being 
processed are correctly tagged. The final measure "correct-d" records the proportion of strings 
that are in the right class. It can happen that the highest scoring string may have a negative 
F. Conversely, some incorrect strings can have a positive F without having the highest score. 
For practical purposes the measures "correct-a" , "-b" , and "-c" will be the significant ones. But 
in analysing the performance of the networks we will be interested in "correct-d" , the extent to 
which the net can generalize and correctly classify strings generated by the test data. 

Metrics of other systems 

Note that these measures relate to a string, not to individual elements of the string. This 
contrasts with some natural language processing systems, in which the measure of correctness 
relates to each word. For instance, automated word tagging systems typically measure success 
by the proportion of words correctly tagged. The best stochastic taggers typically quote success 
rates of 95% to 97% correct. If sentences are very approximately 20 words long on average, this 
can mean that there is an error in many sentences. 



6 Using single layer networks 

6.1 Conversion to Hnearly separable forms 

It is always theoretically possible to solve supervised learning problems with a single layer, feed 
forward network, providing the input data is enhanced in an appropriate way. A good explanation 
is given by Pao ||2^, chapter 8]. Whether this is desirable in any particular case must be investi- 
gated. The enhancement can map the input data onto a space, usually of higher dimensionality, 
where it will be linearly separable. Widrow's valuable 1990 paper on "Perceptron, Madaline, and 
Back Propagation" page 1420] explores these approaches "which offer great simplicity and 
beauty" . 

Figure ^ illustrates the form of the Generalized Single Layer Network (GSLN). This figure is 
derived from Holden and Rayner A non-linear transformation <I> on inputs {x} converts them 
to elements {y}, which a single layer net will then process to produce an output z (temporarily 
assuming 1 output). The <I> functions, or basis functions, can take various forms. They can be 
applied to each input node separately, or, as indicated in the figure, they can model the higher 
order effects of correlation. In our processor $ is an ordered 'AND', described in the following 



section. A similar function is used in the grammatical inference work of Giles et al. [15|; it is 
also used in DNA sequence analysis ||2^. The ^ function can be arithmetic: for instance, for 
polynomial discriminant functions the elements of the input vectors are combined as products 
[p3| , page 135]. Successful uses of this approach include the discrimination of different vowel 
sounds [E3 and the automated interpretation of telephone company data in tabular form [p^]. 
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Figure 4: The Generalized Single Layer Network, GSLN, with 1 output 



Radial basis function (RBF) networks also come into the class of GSLNs. In their two stage 
training procedure the parameters governing the basis functions are determined first. Then a 
single layer net is used for the second stage of processing. Examples include Tarassenko's work 



on the analysis of EGG signals |2£] and a solution to an inverse scattering problem to determine 
particle properties [27|. 

An important characteristic of the GSLN is that processing at different layers is de-coupled. 
The first stage of training is unsupervised: the ^ functions are applied without recourse to desired 
results. In the second stage of training a supervised method is used, in which weight adjustments 
on links are related to target outputs. 

One perspective on the GSLN is given by Bishop e.g. page 89] , who characterises this type 
of system as a special case of the more general multi-layer network. Whereas in the general case 
the basis functions at the first layer are modified during the training process, in this type of system 
the basis functions are fixed independently. Widrow page 1420] puts it the other way round: 
"one can view multi-layer networks as single layer networks with trainable preprocessors...". 
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6.2 The conversion function used in the parser 

We use this approach in the parser by converting a sequence of tags to a higher order set of 
adjacent pairs and triples. The example in section [5^ , stage C shows how the input elements are 
constructed. Thus, one of the elements derived from string 1 is ( [, predeterminer, verb ). 
This of course is not the same as ( predeterminer, [, verb ). The same tag can be repeated 
within a tuple: Computer Science Department maps onto ( noun, noun, noun ). 

This can be related to Figure ^. Let each Xi for i = 1 to m represent a tag. The ^ functions 
map these onto m? pairs and triples, so n = + m^. If s is a sequence of / tags, it is 
transformed into a set S of higher order elements by for pairs and for triples. 

S = Xi. . .Xi,Xi^i,Xi+2 ■■■Xl 

^p{xi) = (xj, Xj+i) for i = 1 to i = / — 1 
^t{xi) = Xi+i, Xi+2) for z = 1 to i = / - 2 

S = {%{xi)}U{^t{xi)} 

For some of our investigations either pairs or triples were used, rather than both. 

The $ function represents an ordered 'AND': the higher order elements preserve sequential 
order. This function was derived using heuristic methods, but the approach was supported by 
an objective analysis of the proposed representation. We aimed to capture some of the implicit 



information in the data, model invariances, represent structure. As described in Section the 
choice of the tupling pre-processing function is supported by information theoretic analysis. It 
captures local, though not distant, dependencies. Using this representation we address simultane- 
ously the issues of converting data to a linearly separable form, modelling its sequential character 
and capturing some of its structure. 

An approach similar to our own has been used to develop neural processors for an analysis 



of DNA sequences |22, page 166]. Initially a multi- layer network was used for one task, but 
an analysis of its operation led to the adoption of an improved representation with a simpler 
network. The input representing bases was converted to codons (tuples of adjacent bases), and 
then processed with a Perceptron. 

6.3 The practical approach 

Minsky and Papert acknowledged that single layer networks could classify linearly inseparable 
data if it was transformed to a sufficiently high order |2^, page 56], but claimed this would 
be impractical. They illustrated the point with the example of a parity tester. However, this 
example is the extreme case, where any change of a single input element will lead to a change 
in the output class. If there are n inputs, then it is necessary to tuple each element together to 
0(n). The consequent explosion of input data would make the method unusable for all but the 
smallest data sets. 



However, in practice, real world data may be different. Shavlik et al |30] compare single and 
multi-layer nets on 6 well known problems, and conclude "Regardless of the reason, data for 

many 'real' problems seems to consist of linearly separable categories Using a Perceptron as 

an initial test system is probably a good idea." This empirical approach is advocated here. Tests 
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for linear separability and related problems are computationally heavy | pl| , |32| , so we tried single 
layer networks to see whether the higher order data we use is in practice linearly separable, or 
nearly so. 

Taking data items as pairs typically produces training sets of which about 97% can be learnt 
by a single layer network; taking triples raises learnability to about 99%. Thus our data is almost 
linearly separable. 

6.4 Linear discriminants - neural and Bayesian methods 

Having established empirically that after transformation we have a linear problem, there are a 
number of different methods of linear discriminant analysis that could be used. Our single layer 
networks are convenient tools. 

We also ran our data through a Bayesian classifier, based on the model described by Duda 
and Hart ||2^, page 32]. Results were about 5% less good on test data than those from Hodyne. 
Though the parsing problem is decomposed so that good estimates can usually be made of prior 
probabilities, estimating class conditional probabilities needs further investigation. (If n possible 
parses are generated and 1 is correct, then the prior is 1/n ). Frequency counts extracted from 
the training data cannot be used as they stand as probability estimates. The zipfian distribution 
of data can distort the probabilities, even when very large quantities are used, so that rare events 
are given too much significance. Moreover, further information on zero frequency items, though 
limited, can be extracted using an appropriate technique, as Dunning shows [10|. There are a 



number of methods of estimating probabilities on the basis of partial information which need 
investigation |33, page 55]. 



These issues can be avoided by using neural discriminators. 
6.5 Training set size 

There is a relationship between training set size and linear separability. Cover's classical work 
addressed the probability that a set of random, real valued vectors with random binary desired 



responses are linearly separable 21 1. Using his terminology and taking the term "pattern" 
to mean a training example, the critical factor is the ratio of number of patterns, H, to number 
of elements in each pattern, n. While H/n < 1, the probability PseparaUe = 1.0. If H/n = 1 then 
Pseparabie = 0.5. As H/n increases, Useparabie quickly declines. 

These observations are given as background information to indicate that training set size 
should be considered, but they do not apply in our case as they stand. First, our data is not 
random. Secondly, a necessary condition that the vectors are in "general position", normally 



satisfied by real valued vectors, may not hold for binary vectors |35, page 97]. 

The number of training examples, 11, is a factor in determining generalization capability (see, 
for example, |3|). The probability that an error is within a certain bound increases with the 
number of training examples. Decreasing H to convert data to a linearly separable form would 
be profitless. 

The ratio of training examples to weights in our data is shown in Figure |5[ Note that the 
corpus used for this preliminary working prototype is small compared to other corpora, and future 
work will use much larger ones, which could affect this ratio. 



7 Three single layer networks 
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Figure 5: Relationship between number of training examples and number of weights, for the 
Perceptron and LMS nets with one output. 



7.1 Architecture 

Refer again to Figure illustrating the GSLN. In this work we compare 3 networks, which all 



use the same $ functions, described in Section 6.1. We now compare methods of processing at 



the second stage, that is the performance of 3 different single layer classifiers. The Perceptron 
and LMS net can be characterised as examples of the GSLN in Figure ^. The Hodyne model 
differs in having 2 outputs, not symmetrically connected, as in Figure y. 



7.2 Methods of adjusting connection weights during training 

When single layer networks are used, we do not have the classic problem of "credit assignment" 
associated with multi-layer networks: the input neurons responsible for incorrect output can be 
identified, and the weights on their links adjusted. There is a choice of methods for updating 
weights. We do not have to use differentiable activation functions, as in the multi-layer Percep- 
tron. These methods can therefore be divided into two broad categories. First there are "direct 
update" methods, used in the traditional Perceptron and Hodyne-type nets, where a weight up- 
date is only invoked if a training vector falls into the wrong class. This approach is related to the 
ideas behind reinforcement learning, but there is no positive reinforcement. If the classification 
is correct weights are left alone. If the classification is incorrect then the weights are incremented 
or decremented. No error measure is needed: the weight update is a function either of the input 
vector (Perceptron), or of the existing weights (Hodyne). 

Secondly, there are "error minimization" approaches, which can also be used in multi-layer 
nets. An error measure, based on the difference between a target value and the actual value 
of the output, is used. This is frequently, as in standard back propagation, a process based on 
minimizing the mean square error, to reach the LMS error [^]. We have used a modified error 
minimization method (Section |7.5| ). 
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often in both correct and incorrect strings. The node ('[' preposition) would not 
occur in a coiTect string, so it is not connected to the "yes" output node. 
Z represents summing function. 



Figure 6: The Ho dyne network. 



7.3 The Perceptron 

The Perceptron and LMS models are both fully connected nets with a single output. Details 



of the well known Perceptron training algorithm, and of the parameters used, are in |38]. The 
output represents F, the grammaticality measure of the input string. In training, a grammatical 
string must produce a positive output, an ungrammatical string a negative output. The wrong 
output triggers a weight adjustment on all links contributing to the result. This is a function of 
the normalized input values, scaled by the learning rate. 

To speed the training process, a method of "guided initialization" sets initial random weight 
within bounds determined by the expected output. To implement this, see whether a new, 
previously unseen input element belongs to a "yes" or "no" string, corresponding to desired 
positive or negative outputs. Then set a random value between 0.0 and 0.3 for "yes", between 0.0 
and —0.3 for "no" |12]. When training is finished, weights on links from unvisited input elements 
are set to 0.0. 

Recall that in testing mode we consider the set of strings generated for each sentence, and 
the string with the highest F measure is taken as the correct one for that sentence. 



7.4 Hodyne 

This network, shown in Figure ^, is derived from the model introduced by Wyard and Nightingale 
[^. The 2 outputs zq and zi represent grammatical and ungrammatical, "yes" and "no", results. 
In training, a grammatical string must produce zq > zi, and vice- versa, else a weight adjustment 
is invoked. In testing mode, as for the Perceptron, the strings generated by each sentence are 
processed, and the string with the highest F score for that sentence is the winner. For this 
network the grammaticality measure F is zq ~ -^i . Since it is not widely known a summary of the 
training method follows. More implementation details can be found in |12|. 
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Notation 

Let each input vector have n elements y^. Let w{t)i j be the weight from the ith input node to the jth 
output node at time t. Let u(t)ij be the update factor. 

(5=— lor5 = +l indicates whether weights should be decremented or incremented. 
Mark all links disabled. 

Initially, percentage of strings correctly classified = 0.0 
REPEAT 

from STARTl to ENDl until % strings correctly classified exceeds chosen threshold: 
STARTl 

REPEAT from START2 to END2 for each string 
START2 

Present input, a binary vector, yi,y2, ■■■■Un 

Present desired output, zq > z\ or vice versa 

For any y > enable link to desired result if it is disabled 

Initialize weight on any new link to 1.0 

Calculate outputs zq and zi 

Ei—n 

If actual result — desired result 

Count string correct, leave weights alone 
Else adjust weights on current active links: 

if Zq < z\ then (5 = +1 on links to zq, 5 = — 1 on links to z\ 

and vice versa 



w{t + l)ij = 



S * w{t)i 



l + ((5*u.(t),j)4 



w(t)i 



END2 

Calculate % strings correctly classified. If greater than threshold, terminate 



ENDl 



For the Hodyne type net the update factor u is a function of the current weights; as the 
weights increase it is asymptotic to 0, as they decrease it becomes equal to 0. 

- 1 ± u,(t)fj 

This function satisfies the requirement that the weights increase monotonically and saturate 
(see Figure 0). We use the original Hodyne function with a comparatively low computational 
load. Note that in contrast to the Perceptron, where the learning rate is set at compile time, the 
effective learning rate in this method varies dynamically. The greatest changes occur when weights 
are near their initial values of 1.0, as they get larger or smaller the weight change decreases. 
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Figure 7: Relationship between old and new weights for the Hodyne net 



Hodyne's pattern of connectivity and the Prohibition Table 

Note that during training elements of a new input vector may be linked to both, either or neither 
output node. This represents the fact that a tuple can appear (i) in both a correct and incorrect 
string, (ii) in either or (iii) in neither. Tables |3| and |^ gives some information on the distribution 
of elements in training and testing sets. The data is asymmetric: any node that appears in a 
grammatical string can also appear in an ungrammatical one, but the reverse is not true. When 
training is completed links from unused inputs are enabled and their weights set to 0.0 

Any single entry in the preliminary Prohibition Table (Section [5. 41 ) can be omitted, and the 
pair or triple be included in the neural processing task. In this case a tuple that cannot occur in 
a grammatical string will, for Hodyne, only be connected to the "no" output node. Conversly, 
if we examine the linkage of the Hodyne net, those tuples that are only connected to the "no" 
output are candidates for inclusion in a constraint based rule. Of course there is a chance that a 
rare grammatical occurrence may show up as the size of the training set increases. 



7.5 LMS network 

The LMS model is based on the traditional method, described in "Parallel Distributed Processing" 
[37, page 322]. A bipolar activation function is used, and outputs are in the range —1 to +1. 
As with the Perceptron, the output represents the F measure for the string being processed. 
Gradient descent is used to reduce the error between desired and actual outputs. 

It has been known for many years in the numerical optimization field that the gradient descent 
technique is a poor, slow method [39|. This is now also accepted wisdom in the neural network 



community |28|. Other training methods, such as conjugate gradients, are usually preferable. For 



20 



this experiment, however, the traditional method has been used, but some variations to speed 
up training and improve performance have been incorporated. Brady et al. |4C] described some 
anomaUes that can arise with the traditional LMS model. As a remedy Sontag's technique for 
interpreting the error measure is included [^]. This means that an error is only recorded if a 
vector falls into the wrong class. The target output is a threshold, and if this threshold is passed 
the vector is considered correctly classified. This contrasts with the original LMS method, in 
which an error is recorded if the target is either undershot or overshot. 



8 Performance 
8.1 Training times 

Since we have been developing a prototype the training threshold was taken so that training 
was fast - less than 10 seconds for the Perceptron, less than 20 seconds for the Hodyne network. 
Subject to this constraint the percentage of strings that could be trained ranged from 96.5% to 
99.0%. The Perceptron was fastest. For the LMS net training times were between 1 and 2 orders 
of magnitude greater, but the inefficient gradient descent method was used. 



8.2 Ability to generalize 

Results are given in Tables ^ to ^ This system has been developed to produce a winning string for 
each sentence, and performance can be assessed on different measures of correctness, as described 
in Section 5.9. For the purpose of investigating the function of the networks we take the strictest 
measure, correct-d in the tables, requiring that strings should be classified correctly. However, we 
can interpret the results so that in practice we get up to 100% correct for our practical application, 
since a winning string may have a negative F measure. Thus the practical measure of correctness 
can be higher than the percentage of correctly classified strings. 

Table ^ gives a summary of the results, showing how these vary with the ratio of test set size 
to training set size. This would be expected. If there is insufficient training data performance 
degrades sharply. 

The Hodyne net performed well, and this architecture was used for the prototype. Previous 
work in this field compared the performance of multi-layer Perceptrons to that of single layer 
models, and found they performed less well. This is discussed in Section 0. 
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Table 6: Results using Perceptron. Recall that correct-a means hypertags are correctly placed, 
correct-b that words inside subject are correctly tagged also, correct-c that all words in part of 
sentence being processed are also correctly tagged, correct-d that the string is in the right class 
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Table 7: Results for Hodyne net on same training and testing data as for the Perceptron (Table ^) 
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Table 8: Results for LMS net on the same data. Compare with Tables |^ and ^ 



Ratio test set / 


Perceptron 


Hodyne 


LMS 


Hodyne 


training set 


% test strings 


% test strings 


% test strings 


% hypertags 




correct 


correct 


correct 


correct 


0.10 


89.4 


92.9 


92.9 


100 


0.20 


88.5 


91.4 


88.5 


100 


0.23 


80.9 


89.2 


85.1 


100 


0.26 


77.4 


84.4 


82.5 


95.5 



Table 9: Summary of results culled from Tables ^, ^, |^ and ^, showing performance on 4 different 
training and test sets. 



23 



9 Understanding the operation of the network 



9.1 The importance of negative information 

Consider the following unremarkable sentence, and some of the strings it generates: 



the 

directions 
given 
below 
must 
be 

carefully 
followed 



string no . 1 : 



strt 



[ 



det 



determiner 
noun 

past-part-verb 
preposition or 
auxiliary verb 
auxiliary verb 
adverb 

past-part-verb 
endpoint 



adverb 



noun 



] 



pastp adv aux aux adv pastp endp 



string no. 2: 

strt [ det noun pastp ] prep aux aux adv pastp endp 

string no. 3: *** target *** 

strt [ det noun pastp prep ] aux aux adv pastp endp 

In the LOB corpus the pair (preposition, modal- verb), which represents the words (below, 
must)0 has a frequency of less than 0.01%, if it occurs at all So when a sentence like this 
is processed in testing mode the particular construction may well not have occurred in any 
training string. However, in the candidate strings that are generated wrong placements should 
be associated with stronger negative weights somewhere in the string. For example, string 2 maps 
onto: 

* [ The directions given ] below must be carefully followed. 

The proposed subject would not be associated with strong negative weights. However, the fol- 
lowing pairs and triples include at least one that is strongly negative, such as ( ], preposition), 
an element in the negative strings generated in the training set. The correct placement, as in 
string 3, would be the least bad, the one with the highest T score. 

By training on negative as well as positive examples we increase the likelihood that in testing 
mode a previously unseen structure can be correctly processed. In this way the probability of 
correctly processing rare constructions is increased. 



9.2 Relationship between frequency of occurrence and weight 

After training we see that the distribution of weights in Hodyne and Perceptron nets have certain 
characteristics in common. In both cases there is a trend for links on the least common input 
tuples to be more heavily weighted than the more common: see Figures ^ and ^. 

This characteristic distribution of weights can be understood when we examine the process 
by which the weights are adapted. Since we are processing negative as well as positive examples 
in the training stage, the movement of weights differs from that found with positive probabilities 

* Modal verbs are included in the class of auxiliary verb in this tagset 
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alone. Some very common tuples will appear frequently in both correct and incorrect strings. 
Consider a pair such as (start-of-sentence, open-subject). This will often occur at the start 
of both grammatical and ungrammatical strings. The result of the learning process is to push 
down the weights on the links to both the "yes" and the "no" output nodes. 

A significant number of nodes represent those tuples that have never occurred in a gram- 
matical strings. A few nodes represent tuples that have only occurred in grammatical strings 
(Tables I and I). 
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Figure 8: Weights plotted against frequency of input node occurring, for training corpus Tr 1 on 
Hodyne 
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Figure 9: Weights plotted against frequency of input node occurring, for training corpus Tr 1 on 
Perceptron 



The relationship between frequency of occurrence and level of weight accompanies the decision 
to assess the correctness of a whole string, rather than the status of each element. Strings that are 
slightly wrong will include tuples that occur both in grammatical and ungrammatical sequences. 
As a consequence we see that the classifcation decision can depend more on infrequently occurring 
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tuples. In particular, tuples that usually only occur in an ungrammatical string can have a 
significant influence on the classiflcation task. 

9.3 Direct update versus error minimization 

The use of direct update rather than error minimization methods may also have an effect on 
generalization. The traditional LMS measure can lead to situations where most input vectors are 
close to the target, while a few, or a single one, are distant. This may be desirable when noisy 
data is processed, but not for our linguistic data, where we want precise fitting. We want to 
capture information from isolated examples. We want to classify strings that are ungrammatical 
in a single element as well as those that are grossly ungrammatical. 



10 Conclusion 

The original objective of this work was to see whether the pattern matching capabilities of neural 
networks could be mobilised for Natural Language Processing tasks. The working partial parser 
demonstrates that they can be. 

Multi-layer Perceptrons were tried in the past, but it was found that single layer networks 
were more effective, provided that the data was appropriately converted to a higher order form. 
Some arguments against this approach centre on the lack of a principled method to find the 
pre-processing, $ function. But though the methods of finding the initial $ function are based 
on intuition, a close initial examination of the data can mean that this intuition is founded on 
an understanding of the data characteristics. In the case of the linguistic data support for the 
non-linear conversion function has come from information theoretic tools. Though the setting 
of parameters is not data driven at the micro level, as in a supervised learning environment, 
the functions are chosen to capture some of the structure and invariances of the data. The 



development of neural processors for an analysis of DNA sequences also illustrates this [22|. 

The analysis in the previous section illustrates the transparency of single layer networks, 
and indicates why they are such convenient tools. Compared to multi-layer Perceptrons, the 
parameters of the processor are more amenable to being interpreted. The Hodyne net in particular 
lends itself to further linguistic analysis. Furthermore, this approach has the advantage of fast 
two stage training. The speed of training, measured in seconds, shows how quickly single layer 
networks can fix their weights. Training times are hardly an issue. 

A significant question of generalization ability is seen to relate to the ratio of testing to training 
data set size. Current work on generalization has focused on principled methods of determining 
training set size to ensure that the probability of generalization error is less than a given bound. 
Having implemented a preliminary prototype our work will continue with much larger corpora. 

In the development of this technology we return to the fundamental question: how do we 
reconcile computational feasibility with empirical relevance? How do we match what can be done 
to what needs to be done? Firstly, in addressing the parsing problem we start by decomposing 
the problem into computationally more tractable subtasks. Then we investigate the data and 
devise a representation that enables the simplest effective processors to be used. The guiding 
principles are to attack complexity by decomposing the problem, and to adopt a reductionist 
approach in designing the neural processors. 
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